Current Issue : April - June Volume : 2014 Issue Number : 2 Articles : 4 Articles
Prosody and prosodic boundaries carry significant information regarding linguistics and paralinguistics and are\r\nimportant aspects of speech. In the field of prosodic event detection, many local acoustic features have been\r\ninvestigated; however, contextual information has not yet been thoroughly exploited. The most difficult aspect of this\r\nlies in learning the long-distance contextual dependencies effectively and efficiently. To address this problem, we\r\nintroduce the use of an algorithm called auto-context. In this algorithm, a classifier is first trained based on a set of\r\nlocal acoustic features, after which the generated probabilities are used along with the local features as contextual\r\ninformation to train new classifiers. By iteratively using updated probabilities as the contextual information, the\r\nalgorithm can accurately model contextual dependencies and improve classification ability. The advantages of this\r\nmethod include its flexible structure and the ability of capturing contextual relationships. When using the\r\nauto-context algorithm based on support vector machine, we can improve the detection accuracy by about 3% and\r\nF-score by more than 7% on both two-way and four-way pitch accent detections in combination with the acoustic\r\ncontext. For boundary detection, the accuracy improvement is about 1% and the F-score improvement reaches 12%.\r\nThe new algorithm outperforms conditional random fields, especially on boundary d...
In this paper, we propose a novel noise-robustness method known as weighted sub-band histogram equalization\r\n(WS-HEQ) to improve speech recognition accuracy in noise-corrupted environments. Considering the observations\r\nthat high- and low-pass portions of the intra-frame cepstral features possess unequal importance for noise-corrupted\r\nspeech recognition, WS-HEQ is intended to reduce the high-pass components of the cepstral features. Furthermore,\r\nwe provide four types of WS-HEQ, which partially refers to the structure of spatial histogram equalization (S-HEQ). In\r\nthe experiments conducted on the Aurora-2 noisy-digit database, the presented WS-HEQ yields significant\r\nrecognition improvements relative to the Mel-scaled filter-bank cepstral coefficient (MFCC) baseline and to cepstral\r\nhistogram normalization (CHN) in various noise-corrupted situations and exhibits a behavior superior to that of S-HEQ\r\nin most cases....
This paper investigates real-time N-dimensional wideband sound source localization in outdoor (far-field) and lowdegree\r\nreverberation cases, using a simple N-microphone arrangement. Outdoor sound source localization in different\r\nclimates needs highly sensitive and high-performance microphones, which are very expensive. Reduction of the\r\nmicrophone count is our goal. Time delay estimation (TDE)-based methods are common for N-dimensional wideband\r\nsound source localization in outdoor cases using at least N + 1 microphones. These methods need numerical analysis\r\nto solve closed-form non-linear equations leading to large computational overheads and a good initial guess to avoid\r\nlocal minima. Combined TDE and intensity level difference or interaural level difference (ILD) methods can reduce\r\nmicrophone counts to two for indoor two-dimensional cases. However, ILD-based methods need only one dominant\r\nsource for accurate localization. Also, using a linear array, two mirror points are produced simultaneously (half-plane\r\nlocalization). We apply this method to outdoor cases and propose a novel approach for N-dimensional entire-space\r\noutdoor far-field and low reverberation localization of a dominant wideband sound source using TDE, ILD, and headrelated\r\ntransfer function (HRTF) simultaneously and only N microphones. Our proposed TDE-ILD-HRTF method tries to\r\nsolve the mentioned problems using source counting, noise reduction using spectral subtraction, and HRTF. A special\r\nreflector is designed to avoid mirror points and source counting used to make sure that only one dominant source is\r\nactive in the localization area. The simple microphone arrangement used leads to linearization of the non-linear closedform\r\nequations as well as no need for initial guess. Experimental results indicate that our implemented method features\r\nless than 0.2 degree error for angle of arrival and less than 10% error for three-dimensional location finding as well as\r\nless than 150-ms processing time for localization of a typical wideband sound source such as a flying object (helicopter)....
Affective computing, especially from speech, is one of the key steps toward building more natural and effective\r\nhuman-machine interaction. In recent years, several emotional speech corpora in different languages have been\r\ncollected; however, Turkish is not among the languages that have been investigated in the context of emotion\r\nrecognition. For this purpose, a new Turkish emotional speech database, which includes 5,100 utterances extracted\r\nfrom 55 Turkish movies, was constructed. Each utterance in the database is labeled with emotion categories (happy,\r\nsurprised, sad, angry, fearful, neutral, and others) and three-dimensional emotional space (valence, activation, and\r\ndominance). We performed classification of four basic emotion classes (neutral, sad, happy, and angry) and estimation\r\nof emotion primitives using acoustic features. The importance of acoustic features in estimating the emotion primitive\r\nvalues and in classifying emotions into categories was also investigated. An unweighted average recall of 45.5% was\r\nobtained for the classification. For emotion dimension estimation, we obtained promising results for activation and\r\ndominance dimensions. For valence, however, the correlation between the averaged ratings of the evaluators and the\r\nestimates was low. The cross-corpus training and testing also showed good results for activation and dominance\r\ndimensions....
Loading....